Try to compare different algorithms on pure single cell results. Both the 10x and C1 SmartSeq sections are using the 81,881 entry ENCODE4 bulk annotation set.
Cell ranger and STAR Solo only produce Gene level count quantifications. When processing 10x data kallisto and salmon alevin produce gene counts as well.
When processing C1 data, Kallisto & Salmon only produce transcript quantifications, so for C1 gene level comparisons, transcript level quantifications are created by grouping all the transcripts by their gene id and summing those groups.
This notebook reuses the correlation method developed for the ENCODE3 Bulk RNA pipeline
My implementation replicate_scores.
Original R version MAD.R
kallisto_em_common = scanpy.read_h5ad(analysis_dir / 'kallisto_em_filtered.h5ad')
kallisto_em_common
AnnData object with n_obs × n_vars = 6287 × 81881
var: 'gene_symbols'
for k in dense_10x:
print(k, dense_10x[k].shape)
cellranger (81881, 6287) solo (81881, 6287) alevin (81881, 6287) kallisto (81881, 6287) kallisto em (81881, 6287)
I have several slides where I wanted to get a sense of how the different algorithms performed on a cell.
What I settled on is a panel of 3 scatter plots that show the cell with the lowest correlation, the median correlation, and the best correlation between two selected algorithms.
Data points are in blue, spike-in values are in red and the line y=x is in green.
For 10x data there were no spike ins added, so all the spikes should be at 0,0 for naive, or min_threshold = $log_2 0.01$. Once in a while there's a read spuriously maps to the spike ins and so there might be one non-zero spike in the 10x data.
These plots are derived from ENCSR874BOF our e10.5 mouse forelimb dataset. They were analyzed with indicies built from the full ENCODE Bulk RNA-seq GTF file. This is unusual, many other 10x comparisons are done with a much smaller set that focuses on protein coding, lincRNA and antisense genes as well as the TR and IG annotations. A quick inspection suggests processed transcripts and pseudogenes appear to be the largest categories removed. See compare-10x-vs-ENCODE-gtf for more details.
The 10x cells selected were the intersection of Cellranger, STAR Solo, and Alevin methods, which happens to be the cells selected by STAR Solo.
plot_cell_correlation_histogram(
tenx_correlations,
'naive_spearman',
'Histogram of 10x per-cell {metric} correlations',
bins=50,
programs=['cellranger', 'solo', 'alevin', 'kallisto em']
)
plot_cell_correlation_histogram(
tenx_correlations,
'naive_spearman',
'Histogram of 10x per-cell {metric} correlations',
bins=50,
programs=['kallisto', 'kallisto em']
)
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'solo')
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'alevin')
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'kallisto')
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'solo', 'alevin')
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'solo', 'kallisto')
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'alevin', 'kallisto')
plot_cell_correlation_histogram(
tenx_correlations,
'rafa_spearman',
'Histogram of 10x per-cell {metric} correlations',
bins=50,
programs=['cellranger', 'solo', 'alevin', 'kallisto em']
)
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'solo')
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'alevin')
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'kallisto')
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'solo', 'alevin')
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'solo', 'kallisto')
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'alevin', 'kallisto')
See compare-kallisto-fragment-and-transcript-length for a deeper analysis. It turns out this version is wrong because the fragment size used was wrong.
plot_cell_correlation_histogram(c1_gene_count_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 count correlations')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon_decoy', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon', 'star')
plot_cell_correlation_histogram(c1_gene_count_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 count correlations')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon_decoy', 'star')
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon', 'star')
plot_cell_correlation_histogram(c1_gene_tpm_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 gene TPM correlations', bins=50)
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')
plot_cell_correlation_histogram(c1_gene_tpm_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 gene TPM correlations', bins=50)
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')
plot_cell_correlation_histogram(c1_transcript_count_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 transcript count correlations', bins=50)
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')
plot_cell_correlation_histogram(c1_transcript_count_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 transcript count correlations', bins=50)
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')
plot_cell_correlation_histogram(c1_transcript_tpm_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 transcript TPM correlations', bins=50)
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')
plot_cell_correlation_histogram(c1_transcript_tpm_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 transcript TPM correlations', bins=50)
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'rsem')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon')
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')